Main Questions

Question 1 & 2

df %>%
  filter(Year == 1962) %>%
  ggplot(aes(y = co2PerCap, x = gdpPercap)) +
  theme_classic() +
  geom_point(color = "red") +
  labs(y = "CO2 emissions (metric tons per capita)", x = "GDP in purchasing power parity (USD per capita)") +
  ggtitle("GDP vs. CO2 emissions in 1962")

df %>%
  filter(Year == 1962) %>%
  ggplot(aes(y = co2PerCap, x = gdpPercap)) +
  theme_classic() +
  scale_y_log10() +
  scale_x_log10() +
  geom_point(color = "red") +
  ggtitle("log GDP vs. log CO2 emissions in 1962") +
  xlab("log GDP in purchasing power parity (USD per capita)") +
  ylab("log CO2 emissions (metric tons per capita)")

After visualizing the original data, we see that there are some large values that are far from most of the smaller values which appear clustered/close to each other. It appears as a GPD per capita increases, CO2 emissions increases at a faster rate, up until the GDP per capital is at about 200,000. We cannot determine if the relationship between the x and y values are linear by just visualizing them.

However, given that the order of magnitude of both x and y values are large, we log transform both x (GDP) and y values (CO2).


Question 3

df <- df %>%
  mutate(logCO2 = log10(co2PerCap), logGDP = log(gdpPercap))

mod <- cor.test(x = df$logCO2, y = df$logGDP) %>% tidy()
mod
## # A tibble: 1 × 8
##   estimate statistic p.value parameter conf.low conf.high method     alternative
##      <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl> <chr>      <chr>      
## 1    0.902      71.8       0      1182    0.891     0.912 Pearson's… two.sided

Pearson’s correlation coefficient indicates the strength of the relationship between the two variables. Log GDP is positively associated with log CO2 at r=0.9.


Question 4

res <- df %>%
  group_by(Year) %>%
  summarise(
    tidy(
      cor.test(x = co2PerCap, y = gdpPercap, method = "kendall")
    )
  ) %>%
  dplyr::slice_max(estimate, n = 1)
res
## # A tibble: 1 × 6
##    Year estimate statistic  p.value method                         alternative
##   <dbl>    <dbl>     <dbl>    <dbl> <chr>                          <chr>      
## 1  2002    0.780      12.9 4.37e-38 Kendall's rank correlation tau two.sided

Kendall’s Tau correlation between CO2 emissions and GDP per capita is the highest during year 2002, at r=0.78.

Question 5

fig <- df %>%
  filter(Year == res$Year) %>%
  plot_ly(
    x = ~logGDP,
    y = ~logCO2,
    size = ~pop,
    color = ~continent,
    # frame = ~Year,
    text = ~`Country Name`,
    hoverinfo = "text",
    type = "scatter",
    mode = "markers"
  )

fig <- fig %>% layout(
  xaxis = list(
    type = "log"
  )
)
fig <- fig %>% animation_opts(
  1000,
  easing = "elastic", redraw = FALSE
)


fig %>%
  layout(
    title = "log GDP vs. log CO2 emissions in 2002", plot_bgcolor = "#e5ecf6", xaxis = list(title = "log CO2 Emissions"),
    yaxis = list(title = "log GDP"), legend = list(title = list(text = "<b> Continent </b>"))
  )

The interactive plot above depicts the relationship between CO2 emissions and GDP per capita in the year (2002) where the correlation between the two variables is the highest as demonstrated in the question above. Hovering over the dots displays the country names, and the dot sizes correspond to the population size of that country.

More Questions

Question 1

What is the relationship between between continent and ‘Energy use (kg of oil equivalent per capita)’?

res <- df %>%
  filter(!is.na(continent)) %>%
  kruskal.test(continent, energyUsePerCap) %>%
  tidy()

We use the Kruskal-Wallis test because it is a non-parametric version of ANOVA. It does not assume normal distribution of residuals The test works on 2 or more independent samples, which may have different sizes.

There is a significant relationship between continent and energy use, as the p-value is smaller than the significant threshold, which we set at 0.05. The p-value is negligible because it is very clsoe to 0

Question 2

Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990?

mod <- df %>%
  filter(continent %in% c("Asia", "Europe"), Year > 1990) %>%
  glm(importPercentageGDP ~ continent, data = .) %>%
  tidy()

mod
## # A tibble: 2 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        46.8       2.61     17.9  3.36e-44
## 2 continentEurope    -5.06      3.56     -1.42 1.58e- 1

While there are many candidate statistical tests we could use to compare the difference in the variable of interest between two groups, a simple linear regression is chosen, because:

We can find out to what extent does the regressor (continent type) affects the regressand (imports of goods and services in terms of % of GDP).

\[\begin{equation} Y_i = \beta_0 + \beta_1 continent + \epsilon_i \end{equation}\]

The null hypothesis is whether \(\beta_{1}\) = 0, where variable Continent = 1 if Europe, = 0 if Asia.

We fit a linear regression model to compare the two groups. There is no significant difference between Europe and Asia with respect to the amount of imports of goods and services in terms percentage of GDP (p<0.05).

A t-test would have also provided us the answer to the question above; linear regression provides the additional advantage of informing us to what extent a change from Asia=0 to Europe=1 affect outcome variable (imports of goods and services in terms of % of GDP), which is indicated by the beta weight, -5.06.


Question 3

What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?

df %>%
  select(Year, `Country Name`, popDensityPerSqKm) %>%
  arrange(Year, desc(popDensityPerSqKm)) %>%
  group_by(Year) %>%
  slice(1:3) %>%
  ggplot(data = ., aes(x = as.factor(Year), y = popDensityPerSqKm, fill = as.factor(`Country Name`))) +
  geom_bar(position = "dodge", stat = "identity") +
  theme_classic() +
  labs(x = "Year", y = "population density (per sq.km)", fill = "Country") +
  ggtitle("Population density in the top 5 highest density countries in Years 1962-2007")

res <- df %>%
  select(Year, `Country Name`, popDensityPerSqKm) %>%
  arrange(Year, desc(popDensityPerSqKm)) %>%
  group_by(Year) %>%
  slice(1:3) %>%
  mutate(
    rnks = row_number(desc(popDensityPerSqKm))
  )   %>% 
   group_by(`Country Name`) %>%
  summarize(mean.rank = mean(rnks)) 
res
## # A tibble: 4 × 2
##   `Country Name`       mean.rank
##   <chr>                    <dbl>
## 1 Hong Kong SAR, China       3  
## 2 Macao SAR, China           1.5
## 3 Monaco                     1.5
## 4 Singapore                  3

The highest-rank country in terms of population density changes across the years, as we can tell from the graph above

To find out which country has the highest averaged ranking, we take the average of their ranks across the years based on their population density. Hong Kong SAR, China, Macao SAR, China, Monaco, Singapore are tied at the first place because their averaged ranking across the period 1962-2007 is the same at 3, 1.5, 1.5, 3.

Question 4

What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ since 1962?

res <- df %>%
  select(Year, `Country Name`, `Life expectancy at birth, total (years)`) %>%
  group_by(`Country Name`) %>%
  summarise(
    diff = `Life expectancy at birth, total (years)`[Year == 2007] - `Life expectancy at birth, total (years)`[Year == 1962],
    .groups = "drop"
  ) %>%
  dplyr::slice_max(diff, n = 5)

res
## # A tibble: 5 × 2
##   `Country Name`  diff
##   <chr>          <dbl>
## 1 Maldives        36.9
## 2 Bhutan          33.2
## 3 Timor-Leste     31.1
## 4 Tunisia         30.9
## 5 Oman            30.8
res %>%
  ggplot(aes(x = reorder(`Country Name`, -diff), y = diff)) +
  geom_bar(position = "dodge", stat = "identity", fill = "lightblue") +
  theme_classic() +
  ggtitle("Increase in Life Expectancy in Years (Period: 1962-2007)") +
  ylab("Years") +
  xlab("Country") +
  geom_text(aes(label = round(diff, 2)), position = position_dodge(width = 0.9), vjust = -0.25)

From the graph above, we see that the top 5 countries that has shown the greatest increase in life expectancy are: Maldives, Bhutan, Timor-Leste, Tunisia, Oman

This answer is based on the absolute difference in life expectancy between year 2007 and year 1962.